Searching based on an identifier of a searcher

ABSTRACT

A query is received to search data, where the query includes a search term. A search of the data is performed in response to the query, wherein the search produces result data based on the search term and an identifier of a searcher submitting the query.

BACKGROUND

Search engines can be used to search data available at various datasources, including websites, data sources within an enterprise, and soforth. There can be a relatively large number of data items returned ina search result produced by a search engine in response to a query.These data items can be ranked in some order, based on predefinedcriteria.

BRIEF DESCRIPTION OF THE DRAWINGS

Some embodiments are described with respect to the following figures:

FIG. 1 is a block diagram of an example arrangement that includes apersonalized search engine according to some implementations;

FIGS. 2 and 3 are flow diagrams of personalized search processesaccording to various implementations; and

FIG. 4 is a block diagram of an example system incorporating someimplementations.

DETAILED DESCRIPTION

To improve search results returned to users in response to searchqueries, personalized searching can be performed. Personalized searchingcan refer to a search that considers search terms of a query as well aspersonal information associated with a searcher who submitted the query.“Improving” a search result can refer to producing a search resulthaving data items identified by a search that is more likely to berelevant to the search term(s) of a submitted query.

Personalized searching can be accomplished by performing queryexpansion, where a search based on a query that contains at least onesearch term is expanded to also consider personal information associatedwith the user. In some examples, query expansion can be based on apersonal profile of a user that was created ahead of time based onvarious information associated with the user. A personal profile caninclude concepts or topics that describe the interests of the user.After creation, the content of the personal profile of the user can beused in performing query expansion for personalized searching. However,creating personal profiles for users can be a relatively time-consumingor complex task. In addition, personal profiles can become out-of-dateafter some time has passed, since user interests may have changed.

In accordance with some implementations, rather than rely on creatingpersonal profiles for users, personalized searching can be based onpersonal content associated with the user that is already part of acorpus of data that is to be searched. A “corpus of data” can refer toany collection of data, whether included in public websites, internaldata sources of an enterprise (e.g., a business concern, governmentagency, educational organization, etc.), or other data sources. Personalcontent of the user can include various types of content, including, asexamples, presentation slides, text documents, e-mails, conversationlogs, social networking posts, and so forth.

Personal content can be considered to “involve” the corresponding user,where content is considered to “involve” a user if the content isauthored by the user, is received by the user, or is produced based onparticipation of the user. In some examples, personal content can alsobe considered to “involve” the corresponding user if the content isauthored by a related party, is received by the related party, or isproduced based on participation of the related party, where the relatedparty can be another user that has some predefined relationship with thetarget user (e.g., the target user's co-worker, the target user's familymember, etc.).

Content involving a user contains, is referred by, or is otherwiseassociated with an identifier of the user. Although reference is made topersonalized searches for users, note that in other examples,personalized searches can be performed in response to queries submittedby other types of entities, such as applications, computers, and soforth. More generally, personalized searching according to someimplementations can be performed in response to queries submitted bysearchers, where a “searcher” can include a user or any other type ofrequesting entity.

FIG. 1 is a block diagram of an example arrangement that incorporatessome implementations. FIG. 1 includes various data sources 102(implemented with storage devices) that are coupled to a data network104. The data sources 102 can store data items 105 that make up a corpusof data (106). The data sources 102 can be internal data sources of anenterprise, public data sources available over the Internet, or acombination of internal and public data sources. Examples of the dataitems 105 can include text files, image files, video files, audio files,presentation slides, e-mails, conversation logs, social networkingposts, and so forth.

A server computer 108 coupled to the data network 104 has a personalizedsearch engine 110 according to some implementations. The personalizedsearch engine 110 is able to perform personalized searching thatproduces result data based on search term(s) of a query as well as anidentifier of a searcher that submitted the query. The identifier of thesearcher can include a name (such as the proper name) of the searcher,an address (such as a physical address or a network address, e.g.,Internet Protocol address) of the searcher, a telephone number of thesearcher, or any other identifier of the searcher. In further examples,the identifier can be based on some combination of the foregoing.Including a combination of multiple elements, such as a name, address,and/or telephone number of the searcher, can produce result data thatcontains data items that are more likely to be relevant to the searcher.The result data produced in response to the query can include selecteddata items 105 from the corpus of data 106.

The identifier of the searcher is obtained independently of a searchinput provided by the searcher to produce the query. For example, theidentifier is not a search term entered by the searcher at a givenclient computer 112 to generate the query. Rather, the identifier of thesearcher is obtained from another source, such as from login informationprovided by the searcher when initially logging into the given clientcomputer 112, or from pre-stored information (e.g., cookie or otherfile) in the given client computer 112. Alternatively, the identifier ofthe searcher can be obtained by the server computer 108 from a differentsource. For example, upon receiving a query from the given clientcomputer 112, the server computer 108 can access this different source(e.g., a database, a lookup table, a list, etc.) to retrieve theidentifier of the searcher that is associated with the given clientcomputer 112.

At least a portion of the result data produced in response to the queryincludes content that involves the searcher, where such content caninclude any of the various example data items mentioned above.Identifying content involving the searcher, based on the identifier ofthe searcher that submitted the query, allows for performance ofpersonalized searching that uses such identified content. In someexamples, the result data can include a collection of data items thatmatch the search term(s) of the query. Within this collection, dataitems that involve the searcher (e.g., was authored by the searcher, wasreceived by the searcher, or otherwise was produced based onparticipation of the searcher) can be ranked higher than other dataitems.

Since personalized searching according to some implementations employcontent that is already part of the corpus of data (106) being searched,the personalized searching is considered an adaptive, on-linepersonalized searching, since the personalization automatically adaptsto changes in content that involve the searcher (e.g., new content beingadded, old content being deleted, content being modified, etc.).

The server computer 108 of FIG. 1 also includes a second search phasequery generator 111 according to some implementations. The second searchphase query generator 111 is used to create a second query for use in atwo-phase personalized search process (discussed further below).

FIG. 1 further depicts a number of client computers 112 connected to thedata network 104, where each client computer has a user interfaceapplication 114 through which a user can submit a query to the servercomputer 108. Examples of the user interface application 114 include aweb browser or any other type of application that presents a userinterface in which a user can enter search terms.

FIG. 2 is a personalized search process 200 according to someimplementations. The process 200 can be performed by the personalizedsearch engine 110 of FIG. 1, for example. The process 200 receives (at202) a query (such as from a client computer 112) to search data in thecorpus of data 106, where the query includes a search term (or multiplesearch terms). In response to the query, the process 200 performs (at204) a search of the corpus of data 106. The search produces result databased on the search term (or search terms) in the query, and furtherbased on an identifier of the searcher that submitted the query. Atleast a portion of the result data includes content that involves thesearcher identified by the identifier.

Query expansion is performed in the personalized search process 200 ofFIG. 2 by considering information of the searcher (the identifier of thesearcher) that is not part of the search term(s) contained in thereceived query.

In further implementations, query expansion can further includeperformance of another search phase that uses the result data producedby the search depicted in FIG. 2. In such further implementations, thetasks of FIG. 2 are considered to be part of a first search phase. Asecond search phase performs a further search based on search termsderived from the result data of the first search phase.

In some implementations, query expansion based on the two-phasesearching noted above can be referred to as query expansion based onrelevance feedback. Relevance feedback is based on the followingconcept. A search engine responds to a first query by providing a rankedlist of result data items. This list of result data items can then beanalyzed to create a second query, which contains search terms that aredifferent from the search terms of the first query. One type ofrelevance feedback is pseudo relevance feedback, which produces searchterms for the second query by using the top-ranked result data itemsfrom the first query, where “top-ranked” result data items can refer tosome predefined number of data items in the result data that areconsidered to be the most relevant according to at least one rankingcriterion.

FIG. 3 is a flow diagram of a two-phase personalized search process 300according to some implementations (which uses query expansion based onrelevance feedback). The process 300 has two search phases 308 and 310,where the first search phase 308 includes tasks 302, 304, and 306, andthe second search phase 310 includes tasks 312, 314, 316, 318, and 320.The process 300 can be performed by the personalized search engine 110and second search phase query generator 111 of FIG. 1.

The personalized search engine 110 receives (at 302) a first query,which is based on input from the searcher. For example, the searcher mayhave entered an input search string (containing one or multiple searchterms) into a user interface application 114 (FIG. 1) that displays auser interface associated with a particular search engine (e.g., 110 inFIG. 1). In response to the entered search string, the correspondingclient computer 112 produces a query that is sent to the personalizedsearch engine 110 of the server computer 108 depicted in FIG. 1.

In response to the received first query, the personalized search engine110 performs (at 304) a search based on the search term(s) of the firstquery, and further based on the identifier of the searcher. As notedabove, the identifier of the searcher is obtained independently of theinput search string provided by the searcher. Rather, the identifier ofthe searcher is obtained from another source (different from the inputsearch string), such as from login information provided by the searcherwhen initially logging into a given client computer 112, or frompre-stored information (e.g., cookie or other file) in the given clientcomputer 112. This identifier obtained from the source can be providedby the given client computer 112 to the server computer 108 with thesearch query. Alternatively, the identifier of the searcher can beobtained by the server computer 108 from a different source. Forexample, upon receiving a query from the given client computer 112, theserver computer 108 can access this different source (e.g., a database,a lookup table, a list, etc.) to retrieve the identifier of the searcherthat is associated with the given client computer 112.

The personalized search engine 110 according to some examples can be aninverted index search engine. An inverted index can refer to an indexdata structure that stores a mapping from content (such as words,numbers, or other terms) to locations in a corpus of data (e.g., 106 inFIG. 1). The inverted index search engine, upon receiving a querycontaining one or multiple search terms, accesses the inverted indexbased on the search term(s) of the query to identify locations of dataitems that contain the search term(s). An example of an inverted indexsearch engine is a Lucene™ search engine from the Apache SoftwareFoundation. Other examples of inverted index search engines can be usedin other implementations.

The search engine 110 produces (at 306) a list of weighted data items inthe result data for the first query. A data item having a higher rank isassigned a higher weight. The list of weighted data items can be in thefollowing form: (ID₁, w₁), (ID₂, w₂), . . . , (ID_(n),w_(n)), where n(which can be greater than or equal to one) represents a number of dataitems in the result data. In the foregoing, ID_(i) represents anidentifier of a data item identified by the search 304 in FIG. 3, andw_(i) represents the weight assigned to the data item identified byID_(i). A data item in the result data produced in response to thesearch 304 of FIG. 3 can have a corresponding rank assigned to the dataitem based on at least one criterion that relates to relevance of thedata item to the search. This rank can be used for producing the weightw_(i).

The list of weighted data items is ordered according to the weights,such as in a descending order (or other order). More generally, theprocess of FIG. 3 can produce (at 306) a data structure that containsinformation relating to data items identified by the search 304, wherethe information can include identifiers of the data items (oralternatively, the data items themselves), as well as indications ofrankings associated with the respective data items.

In the second search phase 310 of the process 300, the second searchphase query creator 111 next identifies (at 312) terms within theweighted data items produced in the first search phase 308, and moreparticularly, within at least a subset of the weighted data items. Thesubset can include the top-m data items according to the assignedweights w_(i), where m can be some predefined number greater than orequal to 1. Pre-processing can be applied to the various terms in theweighted data items to remove certain words that are unlikely to aid inreturning relevant results. Pre-processing can omit stop words, whichare frequently-occurring words, such as “that,” “then,” “when,” etc.Pre-processing can also involve stemming, in which words are convertedto their stem (which refers to the base or root form of the word). Forexample, the stem for “transferring” is “transfer,” the stem for “ideas”is “idea,” and so forth.

For each term j in at least the subset of the weighted data itemsproduced in the first search phase 308, a correlation weight is computed(at 314) by the query creator 111, where the correlation weight is basedon correlation between the term j and the subset of weighted data items.In some examples, the correlation weight can be computed as follows:

$\begin{matrix}{\frac{\sum\limits_{i}{\left( {w_{i} - W} \right)\left( {\delta_{ij} - {DF}_{j}} \right)}}{\sqrt{\sum\limits_{i}\left( {w_{i} - W} \right)^{2}}\sqrt{\sum\limits_{i}\left( {\delta_{ij} - {DF}_{j}} \right)^{2}}},} & \left( {{Eq}.\mspace{14mu} 1} \right)\end{matrix}$where i is iterated through the subset of weighted data items, W is aconstant (e.g., the mean weight assigned to a predefined first number ofthe weighted data items), δ_(ij) is an indicator indicating whether theterm j appears in the i^(th) data item, and DF_(j) is the frequency ofthe term j in the overall corpus of data (e.g., 106 in FIG. 1). In someexamples, δ_(ij) can have the value zero if the term j does not appearin the i^(th) data item, and can have a predefined non-zero value if theterm j appears in the i^(th) data item.

Note that if a term j does not appear in the i^(th) document, then thevalue calculated according to Eq. 1 can be negative, since δ_(ij) isequal to zero. On the other hand, if the term j appears in the i^(th)document, then δ_(ij) is equal to a non-zero value, and the value(δ_(ij)−DF_(i)) can be a positive value. Note also that if a weightw_(i) is less than the value of W (which can be the mean weight of afirst number of the weighted data items), then the value of (w_(i)−W)can also be a negative value.

A higher value of the correlation weight computed according to Eq. 1 fora term j indicates that the term j has a higher frequency of occurrencein higher weighted data items as compared to the frequency of occurrencein the overall corpus of data (106).

In other examples, other types of correlation weights can be computed tocorrelate each term j with data items in at least the subset of theweighted data items produced in the first search phase 308.

The query creator 111 next identifies (at 316) at least a subset of theterms j, where the identified subset can be the terms associated with apredefined top number of terms according to the correlation weightscalculated according to Eq. 1. Generally, the identification of thesubset of the terms j is based on ranking of the terms, which accordingto some examples uses the correlation weights according to Eq. 1.

For example, the top-most correlated terms identified based on thecorrelation weights (some predefined top number of terms associated withthe highest correlation weights) can be used as the search terms for thesecond query, which can be submitted to the search engine 110. The querycreator 111 then produces (at 318) a second query that includes theidentified subset of the terms. Note that the correlation weights can besubmitted with the search terms in the second query to the search engine110. The search engine 110 then performs (at 320) another search basedon the second query. The search results produced in response to thesecond query can then be provided back to the client computer 112 thatsubmitted the first query.

An example of the foregoing process is provided below. A news reportermay desire to search a corpus of data that includes news articles. Thenews reporter enters a search term, such as “smartphone,” at a clientcomputer 112 (FIG. 1) to find news articles relating to “smartphone.” Inresponse, a first query is produced that is submitted to thepersonalized search engine 110 (FIG. 1). In response to the first query,the search engine 110 performs a search based on the search term“smartphone,” and further based on the identifier (e.g., name) of thenews reporter. This search produces a list of news articles that arerelevant to “smartphone,” some of which may have been authored by thenews reporter—the news articles authored by the news reporter may beranked higher in the list of news articles. The news articles in thelist are associated with respective weights.

The foregoing tasks are part of the first search phase 308 depicted inFIG. 3. In the second search phase 310, terms in at least a subset ofthe weighted news articles are identified, and correlation weights arecomputed for each of the identified terms as discussed in connectionwith FIG. 3. Based on the correlation weights, a subset of the terms areidentified and included in a second query. The second search phase 310then processes the second query to again search the corpus of newsarticles. The news articles found in response to the second query arereturned to the news reporter who requested the search.

In the foregoing, it is assumed that the first query submitted to thesearch engine 110 includes a search string entered by the searcher. Inother implementations, similar techniques can be applied in the contextof determining data item similarity (such as document similarity). Dataitem similarity refers to identifying at least one result data item thatis similar to an input data item. To perform data item similaritydetermination, the input data item is parsed to identify terms derivedin the content of the input data item. Pre-processing can be performedon the terms of the input data item, such as to omit stop words and toperform stemming. Terms can consider the tf-idf (where tf stands forterm frequency and idf refers to inverse document frequency of theentire corpus) of each term to decide the probability of the term'simportance for the input data item. In other examples, other techniquesfor selecting terms for performing data item similarity computation canbe performed.

Once the terms are extracted, the personalized search process of FIG. 2or 3 can be performed, in accordance with some implementations, toidentify result data items that are similar to the input data item.

In an example, a searcher may desire to find news articles in a corpusthat are similar to an input news article about the birth of an elephantin a zoo in Los Angeles. To perform such search, terms from the inputnews article are extracted, and a query containing the terms isproduced. The process of FIG. 2 or 3 can then be performed in responseto the query.

FIG. 4 is a block diagram of an example system 400, which can be anexample implementation of the server computer 108 of FIG. 1. The system400 can include machine-readable instructions 402 that can includeinstructions corresponding to the personalized search engine 110 and/orsecond query creator 111 of FIG. 1. The machine-readable instructions402 are executable on one or multiple processors 404, which can becoupled to a network interface 406 (to allow the system 400 tocommunicate over a data network) and to a storage medium (or storagemedia) 408 (to store data). A processor can include a microprocessor,microcontroller, processor module or subsystem, programmable integratedcircuit, programmable gate array, or another control or computingdevice.

The storage medium (or storage media) 408 can be implemented as one ormultiple computer-readable or machine-readable storage media. Thestorage media include different forms of memory including semiconductormemory devices such as dynamic or static random access memories (DRAMsor SRAMs), erasable and programmable read-only memories (EPROMs),electrically erasable and programmable read-only memories (EEPROMs) andflash memories; magnetic disks such as fixed, floppy and removabledisks; other magnetic media including tape; optical media such ascompact disks (CDs) or digital video disks (DVDs); or other types ofstorage devices. Note that the instructions discussed above can beprovided on one computer-readable or machine-readable storage medium, oralternatively, can be provided on multiple computer-readable ormachine-readable storage media distributed in a large system havingpossibly plural nodes. Such computer-readable or machine-readablestorage medium or media is (are) considered to be part of an article (orarticle of manufacture). An article or article of manufacture can referto any manufactured single component or multiple components. The storagemedium or media can be located either in the machine running themachine-readable instructions, or located at a remote site from whichmachine-readable instructions can be downloaded over a network forexecution.

In the foregoing description, numerous details are set forth to providean understanding of the subject disclosed herein. However,implementations may be practiced without some or all of these details.Other implementations may include modifications and variations from thedetails discussed above. It is intended that the appended claims coversuch modifications and variations.

What is claimed is:
 1. A method of multi-phased personalized searching,comprising: performing by a computer system: performing a search of datain response to a first query and an identifier of a searcher, whereinthe search produces result data based on a search term in the firstquery and the identifier of the searcher; selecting a first subset ofterms in data items of a subset of data items in the result data;computing correlation weights for corresponding terms in the firstsubset of terms, where each of the correlation weights is based oncorrelation of the corresponding term to data items in the subset ofdata items; selecting a second subset of the terms in the first subsetof terms according to the correlation weights; producing a second querycontaining the second subset of terms; and processing the second queryto perform further searching that retrieves data items matching thesecond subset of terms from public websites available over an Internet,wherein the multi-phase personalized searching includes: a first searchphase comprising receiving the first query, obtaining the identifier ofthe searcher, and the performing of the search of the data in responseto the first query and the identifier of the searcher, and a secondsearch phase comprising the selecting of the first subset of terms, thecomputing of the correlation weights, the selecting of the second subsetof the terms in the first subset of terms, and the producing of thesecond query.
 2. The method of claim 1, wherein at least a portion ofthe result data includes content involving the searcher identified bythe identifier, and wherein the content involving the searcher includesat least one selected from among a data item authored by the searcher, adata item received by the searcher, and a data item produced based onparticipation of the searcher.
 3. The method of claim 1, whereinreceiving the first query includes receiving the first query responsiveto the search input including a search string entered by the searcher.4. The method of claim 1, wherein receiving the first query includesreceiving the first query that contains terms of an input data item foruse in identifying at least one result data item that is similar to theinput data item.
 5. The method of claim 1, further comprising: rankingdata items in the result data according to relevancy of the data itemsto the search performed in response to the first query; and selectingthe subset of the data items in the result data based on the ranking. 6.The method of claim 1, wherein the identifier of the searcher isobtained independently of a search input provided by the searcher toproduce the first query.
 7. An article comprising at least onenon-transitory machine-readable storage medium storing instructions toperform multi-phase personalized searching, the instructions uponexecution by a computer system causing the computer system to: perform afirst search of a corpus of data based on a first query containing asearch term, and based on an identifier of a searcher that submitted thefirst query; rank data items in result data produced by the firstsearch, the ranking according to relevance of the data items to thefirst search; select a subset of the data items in the result data basedon the ranking; select a first subset of terms in data items of thesubset of data items; compute correlation weights for correspondingterms in the first subset of terms, where each of the correlationweights is based on correlation of the corresponding term to data itemsin the subset of data items; select a second subset of the terms in thefirst subset of terms according to the correlation weights; produce asecond query that contains the terms in the second subset of terms; andperform a second search of the corpus of data based on the second query,the second search retrieving data items matching the second subset ofterms in the second query from public websites available over anInternet.
 8. The article of claim 7, wherein the terms in the subset ofdata items include at least one term in a data item received by thesearcher.
 9. The article of claim 7, wherein the first query containsthe search term entered as a search string by the searcher.
 10. Thearticle of claim 7, wherein the first query contains search termsextracted from an input data item.
 11. The article of claim 10, whereinthe second search produces further result data containing data itemsthat are similar to the input data item.
 12. The article of claim 7,wherein the first search produces the result data from public websitesavailable over the Internet based on the search term and the identifierof the searcher.
 13. A system comprising: at least one processor toperform multi-phase personalized searching comprising: a first searchphase comprising: receiving a first query to search data, the firstquery including terms of an input data item, where the first query is toidentify result data items that are similar to the input data item, andperforming a search of the data in response to the first query, whereinthe search produces result data based on the terms and an identifier ofa searcher submitting the first query, where at least a portion of theresult data includes content involving the searcher identified by theidentifier; and a second search phase comprising: selecting a firstsubset of terms in data items of a subset of data items in the resultdata, computing correlation weights for corresponding terms in the firstsubset of terms, where each of the correlation weights is based oncorrelation of the corresponding term to data items in the subset ofdata items, selecting a second subset of the terms in the first subsetof terms according to the correlation weights, creating a second querycontaining the second subset of terms; and returning further result dataproduced by a second search responsive to the second query, the secondsearch performed after the search responsive to the first query.
 14. Thesystem of claim 13, wherein the second search retrieves data itemsmatching the second subset of terms in the second query from publicwebsites available over an Internet.
 15. The system of claim 13, whereinthe identifier of the searcher is obtained independently of a searchinput provided by the searcher to produce the first query.
 16. Thesystem of claim 13, wherein the search in response to the first query isperformed using the terms of the first query and the identifier of thesearcher against data sources including public websites available overan Internet.
 17. The system of claim 13, wherein the at least oneprocessor is to further: rank data items in the result data produced bythe search in response to the first query, the ranking according torelevancy of the data items to the search performed in response to thefirst query; select the subset of the data items in the result databased on the ranking.